require(knitr)
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 3.1.3
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.1.3
opts_chunk$set(echo=TRUE)
My notes of the Statistical Inference at coursera’s Data Science Specialization.
Statistical Inference is the process of generating conclusions about a population from a noisy sample.
Given a random experiment (say, rolling a dice) is a population quantity that summarize the randomness.
If A and B cannot both occur: P(A U B) = P(A) + P(B)
**A random variable is a numerical outcome of an experiment. A random variable can be:
A PMF is the function that returns the probability that a random discrete variable occurs.
Flip of a coin X = 0 represents tails and X = 1 represents heads:
p(x) = (1/2)^x (1/2)^(1-x) for x = 0,1 =>
p(0) = (1/2)^0 (1/2)^1 = 1/2
p(1) = (1/2)^1 (1/2)^0 = 1/2
But, this is a fair coin. For an unfair coin:
p(x) = D^x (1 - D)^(1-x) for x= 0,1 =>
p(0) = D^0 (1 - D)^(1 - 0) = (1 - D)
p(1) = D^1 (1 - D)^(1 - 1) = D
So, when the coin is fair, D = 0.5 (1/2)
PDF is a function associated to continouns random variable. It must comply with the following rules:
x <- c(-0.5, 0, 1, 1, 1.5)
y <- c(0,0,2,0,0)
plot(x, y, lwd=3, frame=FALSE, type="l")
# This is actually the Beta Probability Density Function. In R we have the function pbeta(p, p1, p2)
# What is the probability that 75% of this population gets draw
pbeta(0.75,2,1) # 2 and 1 are the height and the base of the triangle
## [1] 0.5625
# so the probability is 56.25%
We can see that this can be a PDF. The Area below the plot is equal to one (b x h)/2 = 1x2.0/2 = 1
The CDF (Cummulative Distribution Function) of a random variable, X, returns the probability that the random variable is less than or equal to x:
F(x) = P(X <= x)
The Survival function of a random variable X is defined as the probability that the random variable is greater than x:
S(x) = P(X > x)
So, notice that:
S(x) = 1 - F(x)
F(x) = P(X <= x) = 1/2(Base x Height) = 1/2 x (2x) = x^2
Then S(x) = 1 - x^2
The alpha-th quantile of a distribution with a CDF F is the point x-alpha so that:
F(x-alpha) = alpha
Given the Probability of A when B occurs, the conditional probability of A is:
P(A|B) = P(A interception B) / P(B)
If A are B are unrelated, then P(A|B) = P(A)
Event A is independence of event B if:
P(A|B) = P(A) where P(B) > 0
also P(A inter B) = P(A) x P(B)
So, you cannot multiply probabilities unless you know the events are independent.
Random variables are said to be IID (Idependent and Identically Distributed):
Expected Values are values that characterized a population.
Our sample expected values (the sample mean and variance) will estimate the population versions
E[X] = Sum_x (x p(x))
E[X] is the center of mass of a collection of locations and weights {x, p(x)}
Suppose that a die is rolled and X is the number faced up. What is the Expected Value of X?
E[X] = 1x1/6 + 2x1/6 + 3x1/6 + 4x1/6 + 5x1/6 + 6x1/6 = 3.5
The standard deviation is the square root of the variance.
In the case of the flip of a coin, the variance is such that:
V(x) = p(1 - p), where P is the probability of one of the faces of the coin
###Sample variance
The sample variance:
As you collect more and more data, the distribution of the sample will be more concentrated around the population variance it is trying to estimate.
And the square root of the sample variance is the sample standard deviation.
As I get more and more data, the sample distribution will be more concentrated around the variance of the population that the sample is trying to estimate:
In a simulation what I do is to repeat the experiment as many times as the sample I am selecting and calculate the variance of that repetition experiment. Then what I do is to repeat this set of repetitions many, many times (like thousands of times) and calculate each time the variance of the sample and I get a distribution that is concentrated around the variance of the total population. The larger the sample (the higher the number of repetitions per experiment) the more concentrated is the experiment distribution around the variance of the population.
The following example is the repetition thousand of times of the experiment of roll 10 dices, 20 dices and 30 dices. The distributions in the picture are the distribution of the variance of each repetition of 10, 20 and 30 samples (die rolls). Look that the higher the number of repetitions the more concentrated is the distribution of the resulting experiment:
Remember that the average (the mean) of a sample of variables is also a random variable with its own distribution. The sample mean is the same of the mean of the original population:
E[^X] = u (^X is the sample of the population, u is the mean of the total population)
So, the square root of the variance is the standard deviation. We call it the standard error.
The Bernoulli distribution arises as the result of a binary outcome, i.e, the flip of a coin, where p is the probability of one of the faces of the coin:
Where (n x) reads “n choose x”, counts the number of ways of selecting x items out of n without replacement disregarding the order of the items.
Imagine a couple that has 8 children, 7 out of which are girls and none are twins. If the probability of give birth a boy or a girl is 0.5 (p = 0.5), then the probability of have 7 girls out of 8 children is the following:
(8 7)*0.5^7(1-0.5)^(8-7) + (8 8)*0.5^8(1 - 0.5)^(8 - 8)
In R, this is the code:
choose(8,7)*0.5^7*(1-0.5)^(8-7) + choose(8,8)*.5^8
## [1] 0.03515625
#Also there is a binomial probability function built in in R pbinom#
pbinom(6, size=8, prob=0.5, lower.tail = FALSE)
## [1] 0.03515625
In any normal distribution, the area below the density curve between one and minus one standard deviation covers about 68% of the distribution.
Then the area between -2 and +2 standard deviation is about 95% of the density:
And then, the area between -3 and +3 sd is about 99% of the density:
If we have X that is a Nomal Distribution, then we can convert it to the Standard Normal Distribution (we call it Z) the following way:
So, 68%, 95%, and 99% of a nomal density lies within 1, 2 and 3 standard deviations from the mean respectively.
Similary, we should remember the following percentiles of the normal distribution:
In general, a typical calculation to remember is:
What is the probability that a N(u, sigma2) RV is larger than x?
Some uses of the Poisson Distribution:
Note that:
Other use of the Possion Distribution is to model rates:
The number of people that show up at a bus stop is Poisson with a mean (l) of 2.5 per hour.
If watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time?
ppois(3, lambda = 2.5 * 4)
## [1] 0.01033605
Example, we flip a coin with success probability of 0.01 five hundred times. (so n is large and p is small)
pbinom(2, size = 500, prob = 0.01)
## [1] 0.1233858
ppois(2, lambda = 500 * 0.01)
## [1] 0.124652
The Law of Large Numbers says that as the sample grows, its mean gets closer and closer to the mean of the whole population.
The CLT, states that the distribution of averages of iid variables (properly normilized) becomes that of a standard normal as the sample size increases
require(UsingR)
## Loading required package: UsingR
## Warning: package 'UsingR' was built under R version 3.1.3
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.1.3
## Loading required package: HistData
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.1.3
## Loading required package: grid
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.1.3
## Loading required package: survival
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.1.3
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
##
##
## Attaching package: 'UsingR'
##
## The following object is masked from 'package:survival':
##
## cancer
##
## The following object is masked from 'package:ggplot2':
##
## movies
data(father.son)
x <- father.son$sheight
round((mean(x) + c(-1,1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12,3)
## [1] 5.710 5.738
# if 56 out of 100 people wants to vote for a candidate, what is the 95% confidence interval
0.56 + c(-1,1) * qnorm(0.975) * sqrt(0.56 * 0.44/100)
## [1] 0.4627099 0.6572901
# using the built in R function
binom.test(56,100)$conf.int
## [1] 0.4571875 0.6591640
## attr(,"conf.level")
## [1] 0.95
n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p){
phats <- rbinom(nosim, prob = p, size = n)/n
ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
mean(ll < p & ul > p)
})
plot(pvals, coverage, type="l")
abline(h=0.95)
Confidence interval for the mean: one can gets a confidence interval for the mean by taking the mean and subtracting and adding the relevant normal quantile times the standard error. - Adding and subtracting 2 SEs works for 95% intervals
From this Wikipedia article, I learned about the actual insterpretation of the Confidence Interval.
For example, for the 95% Confidence Interval, that is very commonly used, it is wrong to say:
Instead, what the 95% means is that as we repeat the calculation of the CI for each sample, the population \(\mu\) will be contained within the CI on 95% of the times. That 95% is the Confidence Level.
If \(\hat{X}\) is a sample and the mean of the samples is normally distributed, the 95% CI is calculated as:
\[\hat{X} \pm 1.96\frac{\sigma}{\sqrt{n}}\]
And if X is binomial (i.e coin flip, elections votes yes or no, etc), when p is the probability of one of the options in the population and \(\hat{p}\) is the mean of a sample of X (\(\hat{X}\)), then the CI can be estimated very closely to:
\[\hat{P} \pm \frac{1}{\sqrt{n}}\]
If you have to choose between use the Z Interval or the T Interval, always use the T interval, because as we collect more data the T interval becomes very aproximate to the Z interval:
\[Est \pm TQ x SE_{Est}\]
The T CI is based in the “T Distribution”, which is not normal. When the n of the sample is small and you use standard normal you get CI that are too narrow, so for low values of n you use the T CI. As n gets larger this distinction becomes irrelevant:
\[\frac{\bar{X} - \mu}{\frac{S}{\sqrt{n}}}\]
This confidence intervals follows Gosset’s t distribution with n-1 degress of freedom.
The T Interval is calculated as follow:
\[\bar{X} \pm t_{n-1}\frac{S}{\sqrt{n}}\] where \(t_{n-1}\) is the relevant T quantile
Some notes about T intervals:
Let’s use the sleep dataset of R. This dataset shows the increase in sleeping hours of 10 indivituals after taking certain sleeping medication. We are going to consider that group 1 and 2 are the same group of individuals before and after take the medication.
data(sleep)
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID)))
g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5)
g
Now let’s calculate the mean and the sd of the difference between the two groups. We are taking the data as paired, so as if group 1 and 2 are the same subjects before and after take the sleep medication.
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
difference <- g2 - g1
mn <- mean(difference); s <- sd(difference); n <- 10
Now let’s calculate the t interval using the formula and also using the build in R function:
#calculating the interval with the formula
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
## [1] 0.7001142 2.4598858
#calculating the interval with the R function
t.test(difference)
##
## One Sample t-test
##
## data: difference
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.7001142 2.4598858
## sample estimates:
## mean of x
## 1.58
t.test(g2, g1, paired = TRUE)
##
## Paired t-test
##
## data: g2 and g1
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7001142 2.4598858
## sample estimates:
## mean of the differences
## 1.58
#other way to use the R function
t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
##
## Paired t-test
##
## data: extra by I(relevel(group, 2))
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7001142 2.4598858
## sample estimates:
## mean of the differences
## 1.58
So the t interval we got from the last code chunk is 0.7 to 2.46
As we calculated the 95% t interval, this means that if we repeat this experiment many times in 95% of the times the mean of the population we are estimating will be in between the interval: 0.7 to 2.46
When we need to calculate the CI for two independent groups, for example, we need to measure the blood preasure of two different random groups of subjects, one taking a medication and the other taking a placebo, those groups are independent, so we can’t use the t interval as paired. So the interval calculation is as follow:
Therefore a \((1 - \alpha)\times 100\%\) confidence interval for \(\mu_y - \mu_x\) is \[ \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} \] The pooled variance estimator is \[S_p^2 = {(n_x - 1) S_x^2 + (n_y - 1) S_y^2}/(n_x + n_y - 2)\] Remember this interval is assuming a constant variance across the two groups
With two independent groups (x and y), the degree of fredom is calculated as:
\[n_{x} + n_{y} - 2\]
Remember that if the variance is not about the same between the two groups, this inteval won’t give you a proper coverage.
Suppose that you can see the effect in the blood preasure of taking a oral contraceptive. We have 8 oral contraceptive users vs 21 controls: - \(\bar{X}_{OC} = 132.86\) mmHg with \(S_{OC} = 15.34\) . The S is the SD - \(\bar{X}_{C} = 127.44\) mmHg with \(S_{C} = 18.23\)
In order to calculate the interval, let’s first calculate the Pooled Variance:
sp <- sqrt(((8-1)*15.34^2 + (21-1)*18.23^2)/(8+21-2))
132.86 - 127.44 + c(-1,1) * qt(0.975,8+21-2)*sp*(1/8 + 1/21)^0.5
## [1] -9.521097 20.361097
So, we got -9.521097 20.361097. This is the 95% confidence interval.
In a t interval we are getting the CI of the mean of the differences between the two groups. In the last example, as zero is inside the interval we got we cannot rule out that probably there is not diffence between the average blood preasure between the oral contraceptive users and the control group.
In R, whenever you have or suspect unequal variances, you might want to use the formula:
t.test(..., var.equal = FALSE)
The null hypothesis is assumed true and statistical evidence is required to reject it in favor of a research or alternative hypothesis
There are four possible outcomes of our statistical decision process:
H0| H0| Correctly accept null H0| Ha| Type I error Ha| Ha| Correctly reject null Ha| H0| Type II error
So, a reasonable strategy would reject the null hypothesis if \(\bar{X}\) was larger than some constant, C. Typically, C is chosen so that the probability of a Type I error, \(\alpha\), is 0.05
\(\alpha\) = Type I error rate = Probability of rejecting the null hypothesis when, if fact, the hypothesis is correct.
\(\mu\) = 30 sd = 10 n = 100
\[\frac{10}{\sqrt{100}} = 1 \]
Under \(H_{0}\bar{X} \sim N(30,1)\)
We want to chose C so that the P(\(\bar{X}\) > C;\(H_{0}\)) is 5%
The 95th percentile of a normal distribution is 1.645 sd from the mean
if C = 30 + 1 x 1.645 = 31.645
So, the rule “Reject H0 when ${X} >= 31.645” has the property the probability of rejection is 5% when H0 is true (for the \(\mu_{0}\), \(\sigma\) and n given)
We don’t usually convert the constant C back to the units of the original data. What we do is to calculate the sample mean and see how far it is from the population mean. In the following example, the sample mean \(\bar{X} = 32\), so:
\[\frac{32 - 30}{\frac{10}{\sqrt{100}}}=2\]
And this is greater than 1.645. As we got this sample mean of 2 and we know that the chances that this occur when H0 is true are less than 5%, then in this case we should reject the no Hypothesis (H0).
Or, whenever \[\frac{\sqrt{n}(\bar{X}-\mu_{0})}{s} > Z_{1-\alpha}\] where s is the standard deviation of the population. we reject the H0 hypothesis.
Consider that n = 16 (rather than 100)
The statistic
$$
follows a T distribution with 15 (n-1) degrees of freedom under H0.
Under H0, the probability that it is larger than the 95% of the T distribution is 5%
The 95th percentile of the T distribution with 15 df is 1.7531
qt(0.95, 15)
## [1] 1.75305
So, for our example of the mean of the sample being 32, our test statistic is:
$$ = 0.8
As the 95th percentile of the T distribution with 15 df is 1.7531, we can’t reject the H0 in this example (because 0.8 < 1.7531)
Suppose that you get a T statistic of 2.5 for 15 df testing \(H_{0} : \mu = \mu_{0}\) vs \(H_{a} : \mu > \mu_{0}\)
pt(2.5, 15, lower.tail = FALSE)
## [1] 0.0122529
So, the probabily is 1.2%. So, either H0 is true and we just got a highly rare T statistic or H0 is false.
Finally, the way you use the P-value is: - If the P-value is less than the \(\alpha\) you reject the null hypothesis.